====== (2015) Ebola virus epidemic in West Africa  ======
====== A Minimal Model ======

Version 0.  Draft.  Please do not distribute.
 --- //[[rik.blok@ubc.ca|Rik Blok]] 2015-02-12 //

\(\require{AMSmath}\)

====== Abstract ======

  * inspiration
  * what I did
  * what I found:
    * model gives estimate of unknown infected
    * model gives estimate of how R_0 varied over time
    * model shows second surge in mid-2014 
    * model forecasts end of outbreak and final impact (cases/deaths)
    * model estimates case fatality rate in hospital.  Can be compared empirically.
  * ???
  * profit!

====== - Introduction ======

In the summer of 2014 I learned of the ongoing [[wp>ebola virus epidemic in West Africa]].  By September more people had died than in [[http://www.washingtonpost.com/news/world/wp/2014/09/03/who-ebola-death-toll-tops-1900-surpassing-all-previous-outbreaks-combined/| all previous outbreaks combined]].

How much worse was this tragedy to get?  When would it abate?  

At the time it wasn't obvious from the timeline whether the death toll would continue accelerating or not.

Some [[http://www.economist.com/news/international/21642242-why-projections-ebola-west-africa-turned-out-wrong-predictions-purpose| early forecasts were dire]].  [[http://www.wired.com/2015/05/one-year-later-ebolas-lessons-next-big-outbreak/ | Others]] wondered if the disease would become endemic.

Since then [[http://arxiv.org/find/all/1/all:+ebola/0/1/0/all/0/1 | some good research]] has come out but there was a lot of uncertainty in the fall of 2014.

There was a lot of concern that the [[https://en.wikipedia.org/w/index.php?title=Ebola_virus_epidemic_in_West_Africa&oldid=647444600#Data_sources | data sources were unreliable]] and that the actual number of cases were much higher than reported.

Clearly it was important to have good forecasts to predict the scope of this tragedy and help raise awareness.

Forecasts could also be used to test effectiveness of intervention strategies.

I didn't expect to be able to make any meaningful contributions in these regards.

Rather, I hoped that the process of trying to construct a model would help me understand how models in this area are actually being designed and how useful they are.

In the following sections I will describe how I constructed a model of the ebola epidemic and used it to try forecast the final toll.


====== - Model ======

In this section I describe the mathematical model I constructed to study the ebola epidemic.

===== - Motivation and design choices =====

<blockquote>
A model is a lie that helps you see the truth.
<cite>
Howard Skipper (([[http://books.google.com/books?id=5rF_31RVTnMC&pg=PA139&lpg=PA139&dq=%22A+model+is+a+lie+that+helps+you+see+the+truth%22&source=bl&ots=ulZKs2Du77&sig=zHeywZdwjfGr2Jv5uBIGY3mw5V0&hl=en&sa=X&ei=2VMaUr2DOszFiwLo04HwDw&ved=0CFoQ6AEwCA | Mukherjee, Siddhartha. (2010). The Emperor of All Maladies: A Biography of Cancer. Simon and Schuster. p. 139.]]))
</cite>
</blockquote>

There are many choices to make when designing a model.  Each decision limits the scope of the model but hopefully makes it more useful.

My main goal was to understand how models in this field were designed and applied.  To keep it understandable I needed my model to be as simple as possible.

I also wanted my model to make quantitative predictions.  Therefore it needed to be quantitative, mathematical.

I wanted to see how things changed over time so my model needed to be dynamic (just meaning that the model deals with time-varying quantities).

In my experience models with discrete time steps are prone to artifacts (compare logistic equation vs map) so I chose a continuous time model.

I am particularly interested in emergence, such as how interactions between individuals can lead to surprising outcomes for the population as a whole.  Therefore I was motivated to construct a model describing individual interactions and using it to predict population dynamics, using a technique called reaction kinetics.

I considered building a spatially-structured model (so, for example, individuals might be located in a region and only interact with others nearby).  Space could have been represented as a network of connected nodes (eg. a node for each affected country) or as coordinates in a continuum.  Either approach would probably be more accurate than neglecting space since there are known "hot spots" in this epidemic.

But space often makes a model much harder to understand.  Can get complex effects, such as "rescue effect" you don't see in non-spatial models.

Also, I wasn't certain that space was necessary: it may be possible to capture the essence of the epidemic without it.  I am more concerned with developing a model that helps me understand the epidemic (even if it is only paints with broad strokes) than accurately reproducing the dynamics without understanding them.

Also adding space is akin to adding many more parameters to a model.  Then we risk "overfitting" the model to the data.  (Unless the data it's compared to is also spatial.)

Adding space to the model would also make the model computationally much more complicated.  It would be harder to validate so I would have less confidence if any interesting results I got were meaningful or just a side-effect of an undiscovered error.  The model would have to be "solved" on a computer and spatial structure could make it take much longer (depending on how fine-grained a spatial structure I chose).  

So, with the goal of making the model as simple as possible (but no simpler) I chose not to include a spatial component.  If the model was unable to reasonably represent the dynamics of the epidemic I might suppose that space was a necessary ingredient and consider building it into a new model.

With these thoughts in mind, I started by looking at the well-known [[wp>Compartmental_models_in_epidemiology#Bio-mathematical_deterministic_treatment_of_the_SIR_model | SIR model of epidemics]].  

==== Reaction kinetics ====

The [[wp>Compartmental_models_in_epidemiology#Bio-mathematical_deterministic_treatment_of_the_SIR_model | SIR model]] is a well-known compartmental model in epidemiology.

It simplifies the representation of individuals in a population by assigning them to one of three "compartments": Susceptible, Infected, or Removed.  All other details, such as age, gender, wealth, etc., are neglected.

This is of course a drastic oversimplification.

But it does have one big advantage: it makes the model much easier to understand and work with.

The SIR model is usually depicted as a [[wp>multi-compartment model]] as shown here:

<div center round 550px box>
<BOOKMARK:fig_1>
https://docs.google.com/drawings/d/1GbxaXDyrGUftpzmK4Ms15jOEZoVKhvjO74AxNrQzEnY
Figure 1: SIR as a compartmental model.
</div>

The population is split into three compartments: S, I, and R.  Individuals transition from the S compartment to I at a rate $\alpha S I$ and from I to R at rate $\phi I$ where the variables $S$, $I$, and $R$ denote the density of individuals in each compartment.  

The transition rates are not clearly motivated in this picture.  I prefer an alternative representation, reaction kinetics, that clarifies how the rates come about.  Instead of thinking of populations we look at the interactions that individuals are involved in.  For the SIR model above those processes are

\begin{eqnarray}
S + I & \xrightarrow{\alpha} & 2 I  & \text{(infection)} \\
I     & \xrightarrow{\phi}   & R.   & \text{(removal)} 
\end{eqnarray}

The first reaction shows an infection process, an interaction between a susceptible and an infected individual.  The parameter $\alpha$ indicates the strength of the interaction, the rate per S-I pair in the population.  The second reaction indicates removal of an infected individual, at rate $\phi$ per infected.

The compartmental and reaction kinetics formalism are mathematically equivalent.  Herein, I will use reaction kinetics to better explain the "microscopic" (individual-level) processes that motivate the model.

In the following section I will alter the SIR model to better represent the features I think are most important.

==== - IR Model ====

S is usually included in an SIR model to account for the loss of "targets" as an epidemic spreads, and the subsequent slowing.  

But that only makes sense if the epidemic spreads to a large proportion of those susceptible individuals.  

If only a small fraction are ever infected then we can treat the size of the S "pool" as effectively constant.

Even in the worst hit countries only 1% [citation needed] of the population had been affected so I was hopeful that this would be a valid assumption here.

With this assumption I chose to leave "S" out of my model.  So I had only an "IR model".

\begin{eqnarray}
I & \xrightarrow{\alpha} & 2 I  & \text{(infection)} \\
I & \xrightarrow{\phi}   & R.   & \text{(removal)} 
\end{eqnarray}

But there were some important factors to consider for the I and R-types if I wanted a reasonable representation of this epidemic.


==== - Subcompartments of I ====

I'm not sure it matters but I have an inkling that the latent period, after exposure but before onset of disease is important.  So I chose to include this in my model by adding an Exposed compartment.  Infected cases create new exposed cases that later transition to being infectious themselves.  So our model is:
\begin{eqnarray}
I & \xrightarrow{\alpha}  & I + E  & \text{(exposure)} \\
E & \xrightarrow{\lambda} & I      & \text{(infectious)} \\
I & \xrightarrow{\phi}    & R.     & \text{(removal)} 
\end{eqnarray}

Next we consider patients receiving treatment.  Treatment for ebola involves hospitalization for care and isolation.  

Evidence indicates that care improves survival rates.

Isolation helps prevent spread of the disease.

Isolation and care could be considered as two independent properties (so a patient might be isolated but not cared for, or vice versa) but for the sake of simplicity I will group them together.  I assume patients who are isolated also receive the benefit of care.

So I will split I into two subcompartments: 

  * I = infected but not identified.  Can spread infection and lower survival chances.

  * H = identified and hospitalized.  Infected but isolated and receiving treatment.  Assume can no longer spread infection and higher survival probability.  Contacts are also traced so expect very low transmission rate.  Might also argue that past contacts (traced) are less likely to spread disease but I didn't include in my model.

With these changes the reaction kinetics model becomes 
\begin{eqnarray}
I & \xrightarrow{\alpha}  & I + E  & \text{(exposure)} \\
E & \xrightarrow{\lambda} & I      & \text{(infectious)} \\
I & \xrightarrow{\phi}    & H      & \text{(hospitalization)} \\
I & \rightarrow           & R      & \text{(removal of infectious)} \\ 
H & \rightarrow           & R.     & \text{(removal of hospitalized)} 
\end{eqnarray}

One could argue that E should also be separated into known (via contact tracing) and unknown compartments but I didn't think of that.  So I just have one type E.

Note that we haven't specified the rates of removal yet.  We will discuss removal in more detail next.


==== - Subcompartments of R ====

In the SIR model R refers to "removed" individuals, those who are no longer infectious and are not susceptible to re-acquiring infection.  There are various ways an individual can be removed: two in particular are important for this ebola epidemic.

Individuals can recover or die.

Of course this is a hugely important distinction for the epidemic.  But it might not be obvious that it's important to track the different outcomes in the model.

It is important distinguish between recovery and death in the model because known deaths are recorded statistic so can be used to calibrate model.

But deaths of unknown infected people can't be tracked so the model doesn't distinguish between deaths versus recovery in these cases.

There have been allegations that traditional funeral rights contributed to the spread of the disease.  Other models have therefore included an infectious dead compartment, often F for funeral.  I didn't have any solid evidence or intuition one way or the other so I chose to leave this out for the sake of simplicity.  This would be an ingredient to add if the model wasn't realistic enough without it.

We've identified three compartments or important types to include in our model: E I H.

===== - Representation & analysis =====

==== - Rate equations ====

Above give a complete description of the processes I consider in my model.  Ignore all other factors.

Assuming a large, well-mixed population of individuals undergoing these reactions, law of mass action gives us the dynamical equations governing the population densities.

\begin{eqnarray}
\frac{dE}{dt} & = & \alpha I - \lambda E \\
\frac{dI}{dt} & = & \lambda E - (\phi + \rho_I + \delta) I \\
\frac{dH}{dt} & = & \phi I - (\rho_H + \delta) H.
\end{eqnarray}

We also track two other properties, the number of known cases and deaths, $C$ and $D$, so we can compare against available data:

\begin{eqnarray}
\frac{dC}{dt} & = & \phi I \\
\frac{dD}{dt} & = & \delta H.
\end{eqnarray}

Aside (put in a box?): we can see where these rates come from if we consider which processes might involve recording a new case C or death D.  (Show modified reactions.)  I assume cases are counted when they are hospitalized.

If we know the rate parameters and the initial numbers of each type we can use the rate equations to determine the population at any other time.

In practice it is a difficult problem but there are good mathematical tools to solve these equations numerically.  

Choosing the right tool is an art.  I found ode15s gives consistently good results.


==== - Fixed parameters ====

  * constant parameters & initial conditions

1/lambda =

1/rho_I = 

1/rho_H = 

delta = 

No information on alpha and phi so I fit those to the data.

Originally I fit the initial conditions as well.  

But I was worried this could lead to "garden of eden" scenarios.  Example: more deaths than cases.  

Amazingly, the index case of the current outbreak was identified.

So can set E_0, I_0, H_0, C_0, D_0 = ... on Dec 2, 2013.


==== - Solution for constant rates ====

If all of the above parameters are constant then it is possible to solve in closed-form for the population at any time.

Here's how.


==== - phi as a function of time ====

But there are reasons to think parameters are not all constant.

I don't know what might affect alpha so I will assume it is constant and fit it.

But I think it is easier to imagine that phi changed throughout the epidemic.

Initially, ebola was not suspected so no effective protective measures were taken.  Sick were cared for at home and infected their families.  Hospitalization and quarantine rates were low so phi initially small (but positive).

Eventually, ebola was identified and an international response was mounted.  Education happened and contact tracing.  These would have resulted in better identification of infected with better care and more effective isolation. Phi high.

  * strictly positive

=== - Sigmoid ===

  * originally sigmoid

=== - Bounded polynomial ===

  * later bounded polynomial
  * Gaussian envelope
    * polynomial inside
    * lowest order possible

=== - Stretched polynomial? ===

  * try this in the future?
  * conceptually even simpler

==== - Other model variations ====

  * also tried these variations: odemodel1..11
  * but they didn't fit the data as well at the time
  * haven't kept comparing them but maybe I should've
===== - Calibration =====

==== - Fitting C & D ====

  * data sources. Thanks.
  * parfmincon
  * logscale
  * don't worry about correlations in data


====== - Results ======

  * fit performed every day with latest data


===== - Fitted curves =====

  * animation of fitted curves
  * only started recording in December

===== - R0 =====

  * derivation
  * show it matches analytic solution for const phi (above)
  * plot over all time
  * discuss what it means: early, intermediate, late time
  * where it crosses one
  * sensitivity analysis?  Which parameter has biggest impact?  Especially when R_0 near one?


From [[http://www.nature.com.ezproxy.library.ubc.ca/nature/journal/vaop/ncurrent/full/nature14594.html | Carroll et al. (2015)]]:
<blockquote>
Phylogenetic analysis revealed the dynamic nature of the epidemic and molecular change in the viral sequence (Fig. 2a). Several distinct lineages were identified, with an initial lineage A (Figs 2a, 3 and Extended Data Fig. 2) linked to early Guinean cases dating from March 2014 including the three original viruses published by Baize et al.2. A second lineage, B, emerged in May and June and comprises all the sequences from Gire et al.6 and the remainder of those described here. As the epidemic expanded, lineage A remained confined in Guinea from March to June 2014, except for one sequence from 18 July 2014. A single Liberian sequence from March 2014 grouped within this lineage. No further EBOV genomes that we sequenced from samples taken after July 2014 belonged to lineage A. This clade was likely to have been associated with the original outbreak in Guinea and was almost successfully contained in May 2014 by the interventions of the multi-agency response.
</blockquote>

Do these separate lineages explain the slowdown and resurgence of the epidemic around March 2014?  I see $R_0$ declining until roughly March, then it explodes upwards.  The inflection point of $R_0$ is around February 2014, which corresponds to the start of Lineage B in Figure 3.

===== - Forecast =====

  * contained?
  * model estimates total number of cases (known and unknown)
  * forecast total deaths
  * forecast last death
  * compare with Valdez et al. (2015)

====== - Discussion ======

  * could use model to test interventions

===== - Conclusions =====

  * sympathy
  * benefits
    * learned a lot about curve fitting
    * learned about confidence intervals, especially with cumulative data.  Cite Nychka et al. (2000), Zwiers and von Storch (1995), King et al. (2014).
    * wrote and shared two routines: ''lhsdesigncon'' and ''parfmincon''
    * gained appreciation for importance and difficulty of true forecasting vs. pseudo-forecasting



====== References ======

====== Appendix A: MATLAB code ======

  * run on MATLAB 2014a, Windows 7
  * performance
  * other required files: parfmincon.m, nearestSPD.m, tick2text.m, export_fig.m

^ File                    ^ Description  ^
| testfitode5.m           |   |
| loadMontanaflynnData.m  |   |
| loadWikipediaData.m     |   |
| loadCmriversData.m      |   |
| odemodel12.m            |   |
| dispsection.m           |   |


